10.4 Probability Distributions: Non-Bell-Shaped
There are other distributions that often have a distribution curve that is not bell-shaped. As with all distributions, the shape of the curve depends on the various parameters. In this section, we review two distribution curves that can have non-bell-shaped curves.
Poisson Distribution
The Poisson distribution is also a discrete probability distribution taking only integer values. The Poisson distribution adds the dimension of time or space to the frequency distribution. It tells us the distribution of events per unit of time when we know the average events and we sample many units. The classic examples are cars arriving at a toll booth per hour, or visitors arriving at a website per minute, or typos in 50 lines of code.
These types of processes are called Poisson processes. A Poisson process is a model for a series of discrete events where the average time between events can be measured, but the arrival time is random. The Poisson distribution calculates the probabilities for different numbers of events occurring during the time period.
While we may calculate the average number of cars arriving at a toll booth each minute, we might also want to know how different this is from one minute to another. A business question this could answer would be, “How much capacity do we need to be 95% sure of fully processing automobile toll booth traffic that arrives in any five-minute period?” The average rate per unit of time at which events occur is called lambda. The assumption is that lambda remains constant, or that the events occur at a constant rate and independently of the time since the last event.
Figure 10.19 illustrates several probability mass distributions for different average rates. The horizontal axis, k, is the number of occurrences per unit of time given an average value of occurrences. The average value is lambda. Because this is a discrete distribution, probability values only occur at the integers. In the figure, we have connected the dots to facilitate seeing the flow of the event probabilities.
For example, with a lambda of 4, meaning an average value of 4, we see that the probability of exactly 4 is 0.20. The probability for 3 occurrences is also 0.20. Other occurrences, such as 2 or 1, or 6 or 7, are lower. Higher numbers of occurrences rapidly approach zero. Figure 10.19 also shows the cumulative probabilities. Notice for a lambda of 4 that the probability of 0 through 4 events, or at least 4 events occurring, is over 60%.
Excel has a function that returns the probability of a given number events given the mean value:
For example, a call center receives an average of 180 calls per hour, 24 hours a day. The calls are independent; receiving one does not change the probability of when the next one will arrive. The number of calls received during any minute has a Poisson probability distribution: the most likely numbers are 3, 4, and 2 in that order, but 1 and 5 are less likely and there is a small probability of it being as low as 0 and a very small probability it could be 10. Figure 10.20 illustrates the Poisson distribution for the call center. The calls per minute is simply calls per hour divided by 60.
Figure 10.21 shows several more examples of using the Poisson distribution for business decisions.
Exponential Distribution
The exponential distribution is closely related to the Poisson distribution. The exponential distribution predicts the time period between events in a Poisson process. Or stated in a different way, the exponential distribution is commonly used to predict the expected time for an event to occur. Unlike the Poisson distribution, the exponential distribution is a continuous distribution.
Exponential distributions are often used to calculate product reliability—to predict the length of time a product will last. In other words, they are used to find when the next (first) breakdown will occur. Other Poisson processes include radioactive decay, mutations of DNA, and time until the arrival of the next customer. It is interesting that such things as the length in minutes of long-distance business calls, the amount of change that a random person has in his or her pocket, how much time a postal clerk spends with a customer, or how much money a person spends on a shopping trip can all be modeled using the exponential distribution.
The important physical property to express an exponential distribution is the average time between occurrences. The inverse of the average is called lambda ( λ ), and it is lambda that is used to describe the exponential distribution. Lambda is called the rate parameter of the variable being modeled. So, for example, if the average time for an event is 4 minutes, then the rate parameter is 1/4. Figure 10.22 illustrates the shape of the curve for various values of lambda.
Without going into the mathematics of the exponential distribution, you should notice that when x = 0, the value of the distribution curve is equal to lambda. Thus, when values of lambda are greater than 1, the probability function is also greater than 1. Wait a minute!! We thought probabilities always ranged between 0 and 1. In fact, the truth is that the area under the curve is always less than 1. So the cumulative probability never exceeds 1. And, as we will see in our example, the questions we are able to answer do use the cumulative probability.
The Excel function that gives the probability is similar to the other functions:
As an example, let’s assume a postal worker spends on average about 4 minutes with each customer. This gives a lambda of 0.25. Figure 10.23 shows both the probability distribution and the cumulative distribution for λ = 0.25.
Now let’s find the probability that the clerk spends from four to five minutes with a randomly selected customer. This probability will be the area of the probability for x = 5 minus the area of the probability of x = 4. The areas under the probability distribution curve are represented by the cumulative values. From the previous figure, we take for x = 5 is 0.7135 and subtract for x = 4 of 0.6321 to get the answer probability of 0.0814. Figure 10.24 illustrates this example using calculations, although we use the Excel function in the table in the previous figure.
Resampling Methods: The Bootstrap
With modern computing power, statistics has moved from the traditional approximated sampling distributions (as we have done previously) to resampling methods such as the bootstrap to determine possible error in sample estimates. The purpose of the bootstrap is to test how well our sample data actually matches the population. To do this, we take multiple samples from the population and observe the statistics, such as mean, standard deviation, and IQR of all the samples. We can take the average of the means, standard deviation, and IQR and run tests on those.
To do the bootstrap, we draw additional random samples with replacement and recalculate the statistic or model for each resample. The advantage of this method is that it does not have any assumptions about the sample statistic being normally distributed. The only assumption is that our sample is a valid representative of the population. The bootstrap method tells us how lots of additional samples would behave when drawn from a population like the original sample, so that we can assess the variability of a sample statistic and thus allow us to construct hypothesis tests. In predictive analysis, aggregating multiple bootstrap sample predictions is called bagging, and it often outperforms the prediction of a single model. The bootstrap method is useful when we have a small sample that we are not sure of the population theoretical distribution or samples from unknown distributions.
Each sample gives one statistical estimate, such as the mean. But we do not know the distribution for this mean. Bootstrap, with its multiple resamples with replacement, gives more detail on the probability distribution of the mean, from which we can measure confidence intervals, variance, and so forth.
Figure 10.25 illustrates an original sample of 20 elements on row 4. Beginning on row 8, we take 100 more samples of 20 elements each. In columns W, X, and Y are the mean, standard deviation, and IQR for each of the resamples.
In Figure 10.26 we have constructed a histogram with bin size of 10 units. The bar chart shows the histogram of the data. The figure also illustrates the confidence interval of the bootstrap mean.